Tidyverse is a set of R libraries that enables the best methods for Data Management. I will use the tidyverse libraries to perform cluster analysis and provide this information to other data science teams in the industry.
library(devtools)
install_github("kassambara/factoextra")
We need the ability to transform vectors in our data frames to standard variables. A standard variable is the 0 +|- 1 std. In other words, the variables need to be similar so that clustering algorithms can accurately determine the distance between each variable used in the algorithm. The solution is to standardize the variables using the scale() function.
Tidyverse has evolved over the last year and a half significantly since version 0.7 was released. A new way to mutate columns in a data.frame is to use the function mutate_at(). This function allows you to programatically create a solution to transform specific columns into a different format.
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(20, 30, .2),
y = runif(20, 3, 5),
z = runif(20, 10, 20))
head(dat)
## x y z
## 1 29.75859 4.106667 18.64834
## 2 30.05549 4.292812 10.41857
## 3 30.21689 3.623649 13.17182
## 4 29.53086 4.243638 10.13750
## 5 30.08582 3.659540 12.39026
## 6 30.10121 4.003995 17.06495
dat2 <- dat %>% mutate_at(.vars = vars(c("y","z")),.funs = funs(scale(.) %>% as.vector))
head(dat2)
## x y z
## 1 29.75859 0.31437541 1.6461394
## 2 30.05549 0.73248271 -1.1141179
## 3 30.21689 -0.77055077 -0.1906803
## 4 29.53086 0.62203161 -1.2083897
## 5 30.08582 -0.68993299 -0.4528163
## 6 30.10121 0.08375944 1.1150708
data("USArrests")
str(USArrests)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
set.seed(123)
data("USArrests")
df <- USArrests%>%sample_n(15)
df_scaled <- df%>%mutate_all(.funs =funs(scale(.) %>% as.vector) )
row.names(df_scaled) <- row.names(df)
df_scaled
## Murder Assault UrbanPop Rape
## New Mexico 0.58508090 1.02300309 0.22505574 0.61101857
## Iowa -1.70220419 -1.54760088 -0.68923319 -1.43885018
## Indiana -0.45911447 -0.90775622 -0.12659385 -0.48290177
## Arizona -0.23535832 1.12403120 0.92835492 0.50261205
## Tennessee 1.03259320 -0.06585536 -0.54857336 0.09855138
## Texas 0.90828422 0.08007413 0.92835492 -0.03942055
## Oregon -1.03093574 -0.39139036 0.01406598 0.33507470
## West Virginia -0.83204139 -1.26696726 -1.95517172 -1.63595295
## Missouri -0.01160217 -0.17810880 0.22505574 0.22666818
## Montana -0.75745600 -0.95265760 -0.97055287 -0.93623813
## Nebraska -1.18010651 -1.03123501 -0.33758361 -0.92638299
## California -0.01160217 0.92197499 1.70198401 1.44870532
## South Carolina 1.33093473 0.95565102 -1.32220246 -0.33507470
## Nevada 0.78397525 0.65256671 0.99868483 1.98088278
## Florida 1.57955267 1.58427034 0.92835492 0.59130829
dist_eucl <- dist(df_scaled,method = "euclidean")
dist_eucl
## New Mexico Iowa Indiana Arizona Tennessee Texas
## Iowa 4.1082312
## Indiana 2.4775986 1.7846775
## Arizona 1.0907378 3.9591819 2.5024750
## Tennessee 1.4990012 3.4724739 1.8574561 2.3169189
## Texas 1.3824891 3.7468614 2.0384260 1.6405987 1.4956943
## Oregon 2.1754724 2.3299921 1.1324681 1.9476753 2.1763803 2.2268866
## West Virginia 4.1297061 1.5739722 2.2229105 4.3544334 3.1474682 3.9631926
## Missouri 1.3951436 2.8885151 1.1661044 1.5219757 1.3106718 1.2158464
## Montana 3.0869279 1.2562907 1.0043926 3.2033362 2.2890064 2.8725177
## Nebraska 3.1647982 0.9620290 0.8810612 3.0302391 2.6311604 2.8258957
## California 1.8025742 4.7971697 3.2592923 1.2587645 2.9923364 2.0899781
## South Carolina 1.9621250 4.1334214 2.8509464 2.8719947 1.3852857 2.4693498
## Nevada 1.6284385 5.0562305 3.3639807 1.8578349 2.5524734 2.1047024
## Florida 1.3412730 5.2265362 3.5543219 1.8744570 2.3337108 1.7638102
## Oregon West Virginia Missouri Montana Nebraska California
## Iowa
## Indiana
## Arizona
## Tennessee
## Texas
## Oregon
## West Virginia 2.9272924
## Missouri 1.0680817 3.1751350
## Montana 1.7249703 1.2503714 1.9844533
## Nebraska 1.4651232 1.8157092 1.9337313 0.7652157
## California 2.6178636 5.3248937 2.2101756 4.1110770 3.8705700
## South Carolina 3.1028481 3.4221970 2.4077748 2.9134297 3.4017884 3.7591374
## Nevada 2.8393497 5.3012159 2.2357853 4.1641622 4.1146826 1.2179355
## Florida 3.4087336 5.2173518 2.5030673 4.2233709 4.2851574 2.0746182
## South Carolina Nevada
## Iowa
## Indiana
## Arizona
## Tennessee
## Texas
## Oregon
## West Virginia
## Missouri
## Montana
## Nebraska
## California
## South Carolina
## Nevada 3.3378439
## Florida 2.5258993 1.8538820
Reformat as Matrix
round(as.matrix(dist_eucl),1)
## New Mexico Iowa Indiana Arizona Tennessee Texas Oregon
## New Mexico 0.0 4.1 2.5 1.1 1.5 1.4 2.2
## Iowa 4.1 0.0 1.8 4.0 3.5 3.7 2.3
## Indiana 2.5 1.8 0.0 2.5 1.9 2.0 1.1
## Arizona 1.1 4.0 2.5 0.0 2.3 1.6 1.9
## Tennessee 1.5 3.5 1.9 2.3 0.0 1.5 2.2
## Texas 1.4 3.7 2.0 1.6 1.5 0.0 2.2
## Oregon 2.2 2.3 1.1 1.9 2.2 2.2 0.0
## West Virginia 4.1 1.6 2.2 4.4 3.1 4.0 2.9
## Missouri 1.4 2.9 1.2 1.5 1.3 1.2 1.1
## Montana 3.1 1.3 1.0 3.2 2.3 2.9 1.7
## Nebraska 3.2 1.0 0.9 3.0 2.6 2.8 1.5
## California 1.8 4.8 3.3 1.3 3.0 2.1 2.6
## South Carolina 2.0 4.1 2.9 2.9 1.4 2.5 3.1
## Nevada 1.6 5.1 3.4 1.9 2.6 2.1 2.8
## Florida 1.3 5.2 3.6 1.9 2.3 1.8 3.4
## West Virginia Missouri Montana Nebraska California
## New Mexico 4.1 1.4 3.1 3.2 1.8
## Iowa 1.6 2.9 1.3 1.0 4.8
## Indiana 2.2 1.2 1.0 0.9 3.3
## Arizona 4.4 1.5 3.2 3.0 1.3
## Tennessee 3.1 1.3 2.3 2.6 3.0
## Texas 4.0 1.2 2.9 2.8 2.1
## Oregon 2.9 1.1 1.7 1.5 2.6
## West Virginia 0.0 3.2 1.3 1.8 5.3
## Missouri 3.2 0.0 2.0 1.9 2.2
## Montana 1.3 2.0 0.0 0.8 4.1
## Nebraska 1.8 1.9 0.8 0.0 3.9
## California 5.3 2.2 4.1 3.9 0.0
## South Carolina 3.4 2.4 2.9 3.4 3.8
## Nevada 5.3 2.2 4.2 4.1 1.2
## Florida 5.2 2.5 4.2 4.3 2.1
## South Carolina Nevada Florida
## New Mexico 2.0 1.6 1.3
## Iowa 4.1 5.1 5.2
## Indiana 2.9 3.4 3.6
## Arizona 2.9 1.9 1.9
## Tennessee 1.4 2.6 2.3
## Texas 2.5 2.1 1.8
## Oregon 3.1 2.8 3.4
## West Virginia 3.4 5.3 5.2
## Missouri 2.4 2.2 2.5
## Montana 2.9 4.2 4.2
## Nebraska 3.4 4.1 4.3
## California 3.8 1.2 2.1
## South Carolina 0.0 3.3 2.5
## Nevada 3.3 0.0 1.9
## Florida 2.5 1.9 0.0
library(factoextra)
dist_cor <- df_scaled%>%get_dist(method = "pearson")
dist_cor
## New Mexico Iowa Indiana Arizona Tennessee
## Iowa 1.7430861618
## Indiana 1.9993000724 0.2798608443
## Arizona 0.8119961242 0.4874315369 1.2185300116
## Tennessee 0.7564479997 1.8110595149 1.2156955577 1.9000652111
## Texas 1.6844417363 0.6148005836 0.3215807247 1.3936309359 0.8289046023
## Oregon 1.2427044146 0.4643223537 0.7617665205 0.4299952340 1.7480149191
## West Virginia 0.4628693388 1.8753742240 1.5204303589 1.6473835569 0.0937305723
## Missouri 1.8284117502 0.3750934828 0.1633286827 1.0745135169 1.3452076912
## Montana 0.9809849228 1.6227971884 0.9954198056 1.9331290835 0.0423571758
## Nebraska 1.7302885598 0.0009410002 0.2924233724 0.4677819240 1.8291745502
## California 1.3732491609 0.2225017310 0.6429238919 0.2752917769 1.9225257476
## South Carolina 0.2703243683 1.9157384353 1.7170235624 1.4545938981 0.2048075869
## Nevada 1.2326614512 0.9330099254 0.7462494020 1.0820535163 1.1310128634
## Florida 0.4763110416 1.5012254468 1.5315920518 1.1657053188 0.5203506282
## Texas Oregon West Virginia Missouri
## Iowa
## Indiana
## Arizona
## Tennessee
## Texas
## Oregon 1.5044687364
## West Virginia 0.9170141633 1.8758632323
## Missouri 0.8305350501 0.3369293621 1.7084277042
## Montana 0.5524565684 1.8246261083 0.1545107779 1.2608930868
## Nebraska 0.6537292870 0.4286215632 1.8954640888 0.3634413119
## California 1.2700494274 0.0663202604 1.9812441989 0.3726086638
## South Carolina 1.1178823291 1.7941587197 0.0319871933 1.8334231301
## Nevada 1.4884400736 0.2484736333 1.4468610163 0.2721427896
## Florida 0.7383727570 1.8941865505 0.2212999751 1.9038779975
## Montana Nebraska California South Carolina
## Iowa
## Indiana
## Arizona
## Tennessee
## Texas
## Oregon
## West Virginia
## Missouri
## Montana
## Nebraska 1.6505674570
## California 1.9150654892 0.1960392444
## South Carolina 0.3144558967 1.9291196168 1.9066213856
## Nevada 1.2588761126 0.9001823525 0.5039891103 1.4604702638
## Florida 0.4909321936 1.5295984156 1.7797512047 0.1944633685
## Nevada
## Iowa
## Indiana
## Arizona
## Tennessee
## Texas
## Oregon
## West Virginia
## Missouri
## Montana
## Nebraska
## California
## South Carolina
## Nevada
## Florida 1.8968746957
round(as.matrix(dist_cor),1)
## New Mexico Iowa Indiana Arizona Tennessee Texas Oregon
## New Mexico 0.0 1.7 2.0 0.8 0.8 1.7 1.2
## Iowa 1.7 0.0 0.3 0.5 1.8 0.6 0.5
## Indiana 2.0 0.3 0.0 1.2 1.2 0.3 0.8
## Arizona 0.8 0.5 1.2 0.0 1.9 1.4 0.4
## Tennessee 0.8 1.8 1.2 1.9 0.0 0.8 1.7
## Texas 1.7 0.6 0.3 1.4 0.8 0.0 1.5
## Oregon 1.2 0.5 0.8 0.4 1.7 1.5 0.0
## West Virginia 0.5 1.9 1.5 1.6 0.1 0.9 1.9
## Missouri 1.8 0.4 0.2 1.1 1.3 0.8 0.3
## Montana 1.0 1.6 1.0 1.9 0.0 0.6 1.8
## Nebraska 1.7 0.0 0.3 0.5 1.8 0.7 0.4
## California 1.4 0.2 0.6 0.3 1.9 1.3 0.1
## South Carolina 0.3 1.9 1.7 1.5 0.2 1.1 1.8
## Nevada 1.2 0.9 0.7 1.1 1.1 1.5 0.2
## Florida 0.5 1.5 1.5 1.2 0.5 0.7 1.9
## West Virginia Missouri Montana Nebraska California
## New Mexico 0.5 1.8 1.0 1.7 1.4
## Iowa 1.9 0.4 1.6 0.0 0.2
## Indiana 1.5 0.2 1.0 0.3 0.6
## Arizona 1.6 1.1 1.9 0.5 0.3
## Tennessee 0.1 1.3 0.0 1.8 1.9
## Texas 0.9 0.8 0.6 0.7 1.3
## Oregon 1.9 0.3 1.8 0.4 0.1
## West Virginia 0.0 1.7 0.2 1.9 2.0
## Missouri 1.7 0.0 1.3 0.4 0.4
## Montana 0.2 1.3 0.0 1.7 1.9
## Nebraska 1.9 0.4 1.7 0.0 0.2
## California 2.0 0.4 1.9 0.2 0.0
## South Carolina 0.0 1.8 0.3 1.9 1.9
## Nevada 1.4 0.3 1.3 0.9 0.5
## Florida 0.2 1.9 0.5 1.5 1.8
## South Carolina Nevada Florida
## New Mexico 0.3 1.2 0.5
## Iowa 1.9 0.9 1.5
## Indiana 1.7 0.7 1.5
## Arizona 1.5 1.1 1.2
## Tennessee 0.2 1.1 0.5
## Texas 1.1 1.5 0.7
## Oregon 1.8 0.2 1.9
## West Virginia 0.0 1.4 0.2
## Missouri 1.8 0.3 1.9
## Montana 0.3 1.3 0.5
## Nebraska 1.9 0.9 1.5
## California 1.9 0.5 1.8
## South Carolina 0.0 1.5 0.2
## Nevada 1.5 0.0 1.9
## Florida 0.2 1.9 0.0
Gower’s metric
library(cluster)
data(flower)
head(flower,3)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0 1 1 4 3 15 25 15
## 2 1 0 0 2 1 3 150 50
## 3 0 1 0 3 3 1 150 50
str(flower)
## 'data.frame': 18 obs. of 8 variables:
## $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
## $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
## $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
## $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
## $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
## $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
## $ V7: num 25 150 150 125 20 50 40 100 25 100 ...
## $ V8: num 15 50 50 50 15 40 20 15 15 60 ...
dd <- daisy(flower)
dd
## Dissimilarities :
## 1 2 3 4 5 6 7
## 2 0.8875408
## 3 0.5272467 0.5147059
## 4 0.3517974 0.5504493 0.5651552
## 5 0.4115605 0.6226307 0.3726307 0.6383578
## 6 0.2269199 0.6606209 0.3003268 0.4189951 0.3443627
## 7 0.2876225 0.5999183 0.4896242 0.3435866 0.4197712 0.1892974
## 8 0.4234069 0.4641340 0.6038399 0.2960376 0.4673203 0.5714869 0.4107843
## 9 0.5808824 0.4316585 0.4463644 0.8076797 0.3306781 0.5136846 0.5890931
## 10 0.6094363 0.4531046 0.4678105 0.5570670 0.3812908 0.4119281 0.5865196
## 11 0.3278595 0.7096814 0.5993873 0.6518791 0.3864788 0.4828840 0.5652369
## 12 0.4267565 0.5857843 0.6004902 0.5132761 0.5000817 0.5248366 0.6391340
## 13 0.5196487 0.5248366 0.5395425 0.7464461 0.2919118 0.4524510 0.5278595
## 14 0.2926062 0.5949346 0.6096405 0.3680147 0.5203431 0.3656863 0.5049837
## 15 0.6221814 0.3903595 0.5300654 0.5531454 0.4602124 0.5091503 0.3345588
## 16 0.6935866 0.3575163 0.6222222 0.3417892 0.7301471 0.5107843 0.4353758
## 17 0.7765114 0.1904412 0.5801471 0.4247141 0.6880719 0.5937092 0.5183007
## 18 0.4610294 0.4515114 0.7162173 0.4378268 0.4755310 0.6438317 0.4692402
## 8 9 10 11 12 13 14
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 0.6366422
## 10 0.6639706 0.4256127
## 11 0.4955474 0.4308007 0.3948121
## 12 0.4216503 0.4194036 0.3812092 0.2636029
## 13 0.5754085 0.2181781 0.3643791 0.3445670 0.2331699
## 14 0.4558007 0.4396650 0.3609477 0.2838644 0.1591503 0.3784314
## 15 0.4512255 0.2545343 0.4210784 0.4806781 0.4295752 0.3183007 0.4351307
## 16 0.6378268 0.6494690 0.3488562 0.7436683 0.6050654 0.5882353 0.4598039
## 17 0.4707516 0.6073938 0.3067810 0.7015931 0.5629902 0.5461601 0.5427288
## 18 0.1417892 0.5198529 0.8057598 0.5359477 0.5495507 0.5733252 0.5698121
## 15 16 17
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16 0.3949346
## 17 0.3528595 0.1670752
## 18 0.5096814 0.7796160 0.6125408
##
## Metric : mixed ; Types = N, N, N, N, O, O, I, I
## Number of objects : 18
round(as.matrix(dd),2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1 0.00 0.89 0.53 0.35 0.41 0.23 0.29 0.42 0.58 0.61 0.33 0.43 0.52 0.29 0.62
## 2 0.89 0.00 0.51 0.55 0.62 0.66 0.60 0.46 0.43 0.45 0.71 0.59 0.52 0.59 0.39
## 3 0.53 0.51 0.00 0.57 0.37 0.30 0.49 0.60 0.45 0.47 0.60 0.60 0.54 0.61 0.53
## 4 0.35 0.55 0.57 0.00 0.64 0.42 0.34 0.30 0.81 0.56 0.65 0.51 0.75 0.37 0.55
## 5 0.41 0.62 0.37 0.64 0.00 0.34 0.42 0.47 0.33 0.38 0.39 0.50 0.29 0.52 0.46
## 6 0.23 0.66 0.30 0.42 0.34 0.00 0.19 0.57 0.51 0.41 0.48 0.52 0.45 0.37 0.51
## 7 0.29 0.60 0.49 0.34 0.42 0.19 0.00 0.41 0.59 0.59 0.57 0.64 0.53 0.50 0.33
## 8 0.42 0.46 0.60 0.30 0.47 0.57 0.41 0.00 0.64 0.66 0.50 0.42 0.58 0.46 0.45
## 9 0.58 0.43 0.45 0.81 0.33 0.51 0.59 0.64 0.00 0.43 0.43 0.42 0.22 0.44 0.25
## 10 0.61 0.45 0.47 0.56 0.38 0.41 0.59 0.66 0.43 0.00 0.39 0.38 0.36 0.36 0.42
## 11 0.33 0.71 0.60 0.65 0.39 0.48 0.57 0.50 0.43 0.39 0.00 0.26 0.34 0.28 0.48
## 12 0.43 0.59 0.60 0.51 0.50 0.52 0.64 0.42 0.42 0.38 0.26 0.00 0.23 0.16 0.43
## 13 0.52 0.52 0.54 0.75 0.29 0.45 0.53 0.58 0.22 0.36 0.34 0.23 0.00 0.38 0.32
## 14 0.29 0.59 0.61 0.37 0.52 0.37 0.50 0.46 0.44 0.36 0.28 0.16 0.38 0.00 0.44
## 15 0.62 0.39 0.53 0.55 0.46 0.51 0.33 0.45 0.25 0.42 0.48 0.43 0.32 0.44 0.00
## 16 0.69 0.36 0.62 0.34 0.73 0.51 0.44 0.64 0.65 0.35 0.74 0.61 0.59 0.46 0.39
## 17 0.78 0.19 0.58 0.42 0.69 0.59 0.52 0.47 0.61 0.31 0.70 0.56 0.55 0.54 0.35
## 18 0.46 0.45 0.72 0.44 0.48 0.64 0.47 0.14 0.52 0.81 0.54 0.55 0.57 0.57 0.51
## 16 17 18
## 1 0.69 0.78 0.46
## 2 0.36 0.19 0.45
## 3 0.62 0.58 0.72
## 4 0.34 0.42 0.44
## 5 0.73 0.69 0.48
## 6 0.51 0.59 0.64
## 7 0.44 0.52 0.47
## 8 0.64 0.47 0.14
## 9 0.65 0.61 0.52
## 10 0.35 0.31 0.81
## 11 0.74 0.70 0.54
## 12 0.61 0.56 0.55
## 13 0.59 0.55 0.57
## 14 0.46 0.54 0.57
## 15 0.39 0.35 0.51
## 16 0.00 0.17 0.78
## 17 0.17 0.00 0.61
## 18 0.78 0.61 0.00
library(factoextra)
fviz_dist(dist_eucl)
data("USArrests")
df <- USArrests #%>%sample_n(15)
df_scaled <- df%>%mutate_all(.funs =funs(scale(.) %>% as.vector) )
row.names(df_scaled) <- row.names(df)
df_scaled
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.78283935 -0.52090661 -0.003416473
## Alaska 0.50786248 1.10682252 -1.21176419 2.484202941
## Arizona 0.07163341 1.47880321 0.99898006 1.042878388
## Arkansas 0.23234938 0.23086801 -1.07359268 -0.184916602
## California 0.27826823 1.26281442 1.75892340 2.067820292
## Colorado 0.02571456 0.39885929 0.86080854 1.864967207
## Connecticut -1.03041900 -0.72908214 0.79172279 -1.081740768
## Delaware -0.43347395 0.80683810 0.44629400 -0.579946294
## Florida 1.74767144 1.97077766 0.99898006 1.138966691
## Georgia 2.20685994 0.48285493 -0.38273510 0.487701523
## Hawaii -0.57123050 -1.49704226 1.20623733 -0.110181255
## Idaho -1.19113497 -0.60908837 -0.79724965 -0.750769945
## Illinois 0.59970018 0.93883125 1.20623733 0.295524916
## Indiana -0.13500142 -0.69308401 -0.03730631 -0.024769429
## Iowa -1.28297267 -1.37704849 -0.58999237 -1.060387812
## Kansas -0.41051452 -0.66908525 0.03177945 -0.345063775
## Kentucky 0.43898421 -0.74108152 -0.93542116 -0.526563903
## Louisiana 1.74767144 0.93883125 0.03177945 0.103348309
## Maine -1.30593210 -1.05306531 -1.00450692 -1.434064548
## Maryland 0.80633501 1.55079947 0.10086521 0.701231086
## Massachusetts -0.77786532 -0.26110644 1.34440885 -0.526563903
## Michigan 0.99001041 1.01082751 0.58446551 1.480613993
## Minnesota -1.16817555 -1.18505846 0.03177945 -0.676034598
## Mississippi 1.90838741 1.05882502 -1.48810723 -0.441152078
## Missouri 0.27826823 0.08687549 0.30812248 0.743936999
## Montana -0.41051452 -0.74108152 -0.86633540 -0.515887425
## Nebraska -0.80082475 -0.82507715 -0.24456358 -0.505210947
## Nevada 1.01296983 0.97482938 1.06806582 2.644350114
## New Hampshire -1.30593210 -1.36504911 -0.65907813 -1.252564419
## New Jersey -0.08908257 -0.14111267 1.62075188 -0.259651949
## New Mexico 0.82929443 1.37080881 0.30812248 1.160319648
## New York 0.76041616 0.99882813 1.41349461 0.519730957
## North Carolina 1.19664523 1.99477641 -1.41902147 -0.547916860
## North Dakota -1.60440462 -1.50904164 -1.48810723 -1.487446939
## Ohio -0.11204199 -0.60908837 0.65355127 0.017936483
## Oklahoma -0.27275797 -0.23710769 0.16995096 -0.131534211
## Oregon -0.66306820 -0.14111267 0.10086521 0.861378259
## Pennsylvania -0.34163624 -0.77707965 0.44629400 -0.676034598
## Rhode Island -1.00745957 0.03887798 1.48258036 -1.380682157
## South Carolina 1.51807718 1.29881255 -1.21176419 0.135377743
## South Dakota -0.91562187 -1.01706718 -1.41902147 -0.900240639
## Tennessee 1.24256408 0.20686926 -0.45182086 0.605142783
## Texas 1.12776696 0.36286116 0.99898006 0.455672088
## Utah -1.05337842 -0.60908837 0.99898006 0.178083656
## Vermont -1.28297267 -1.47304350 -2.31713632 -1.071064290
## Virginia 0.16347111 -0.17711080 -0.17547783 -0.056798864
## Washington -0.86970302 -0.30910395 0.51537975 0.530407436
## West Virginia -0.47939280 -1.07706407 -1.83353601 -1.273917376
## Wisconsin -1.19113497 -1.41304662 0.03177945 -1.113770203
## Wyoming -0.22683912 -0.11711392 -0.38273510 -0.601299251
library(factoextra)
fviz_nbclust(df_scaled,kmeans,method = "wss")+
geom_vline(xintercept = 4,linetype=2)
set.seed(123)
km_res <- kmeans(df_scaled,4,nstart = 25)
print(km_res)
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 0.6950701 1.0394414 0.7226370 1.27693964
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 3 3 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 2 4 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 3 2 1 2 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 4 2 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 4 2 3
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 2 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 4 3 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 11.952463 16.212213 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
the_us_arrests <- USArrests
the_us_arrests$state <- rownames(USArrests)
the_clusters <- data.frame(state = names(km_res$cluster),cluster=km_res$cluster)
the_us_arrests <- left_join(the_us_arrests,the_clusters)
the_us_arrests
## Murder Assault UrbanPop Rape state cluster
## 1 13.2 236 58 21.2 Alabama 1
## 2 10.0 263 48 44.5 Alaska 4
## 3 8.1 294 80 31.0 Arizona 4
## 4 8.8 190 50 19.5 Arkansas 1
## 5 9.0 276 91 40.6 California 4
## 6 7.9 204 78 38.7 Colorado 4
## 7 3.3 110 77 11.1 Connecticut 3
## 8 5.9 238 72 15.8 Delaware 3
## 9 15.4 335 80 31.9 Florida 4
## 10 17.4 211 60 25.8 Georgia 1
## 11 5.3 46 83 20.2 Hawaii 3
## 12 2.6 120 54 14.2 Idaho 2
## 13 10.4 249 83 24.0 Illinois 4
## 14 7.2 113 65 21.0 Indiana 3
## 15 2.2 56 57 11.3 Iowa 2
## 16 6.0 115 66 18.0 Kansas 3
## 17 9.7 109 52 16.3 Kentucky 2
## 18 15.4 249 66 22.2 Louisiana 1
## 19 2.1 83 51 7.8 Maine 2
## 20 11.3 300 67 27.8 Maryland 4
## 21 4.4 149 85 16.3 Massachusetts 3
## 22 12.1 255 74 35.1 Michigan 4
## 23 2.7 72 66 14.9 Minnesota 2
## 24 16.1 259 44 17.1 Mississippi 1
## 25 9.0 178 70 28.2 Missouri 4
## 26 6.0 109 53 16.4 Montana 2
## 27 4.3 102 62 16.5 Nebraska 2
## 28 12.2 252 81 46.0 Nevada 4
## 29 2.1 57 56 9.5 New Hampshire 2
## 30 7.4 159 89 18.8 New Jersey 3
## 31 11.4 285 70 32.1 New Mexico 4
## 32 11.1 254 86 26.1 New York 4
## 33 13.0 337 45 16.1 North Carolina 1
## 34 0.8 45 44 7.3 North Dakota 2
## 35 7.3 120 75 21.4 Ohio 3
## 36 6.6 151 68 20.0 Oklahoma 3
## 37 4.9 159 67 29.3 Oregon 3
## 38 6.3 106 72 14.9 Pennsylvania 3
## 39 3.4 174 87 8.3 Rhode Island 3
## 40 14.4 279 48 22.5 South Carolina 1
## 41 3.8 86 45 12.8 South Dakota 2
## 42 13.2 188 59 26.9 Tennessee 1
## 43 12.7 201 80 25.5 Texas 4
## 44 3.2 120 80 22.9 Utah 3
## 45 2.2 48 32 11.2 Vermont 2
## 46 8.5 156 63 20.7 Virginia 3
## 47 4.0 145 73 26.2 Washington 3
## 48 5.7 81 39 9.3 West Virginia 2
## 49 2.6 53 66 10.8 Wisconsin 2
## 50 6.8 161 60 15.6 Wyoming 3
the_us_arrests%>%group_by(cluster)%>%summarise_if(.predicate = is.numeric,.funs = mean)
## # A tibble: 4 x 5
## cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 13.9 244. 53.8 21.4
## 2 2 3.6 78.5 52.1 12.2
## 3 3 5.66 139. 73.9 18.8
## 4 4 10.8 257. 76 33.2
fviz_cluster(km_res,data = df,
palette=c("#C1FFC1", "#FFB6C1", "#98F5FF", "#FFD700"),
ellipse.type = "euclid",star.plot=TRUE,repel = TRUE,ggtheme = theme_minimal())
K-medoids utilize the median to remove the impact of outliers on the cluster.
data("USArrests")
df <- USArrests #%>%sample_n(15)
df_scaled <- df%>%mutate_all(.funs =funs(scale(.) %>% as.vector) )
row.names(df_scaled) <- row.names(df)
df_scaled
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.78283935 -0.52090661 -0.003416473
## Alaska 0.50786248 1.10682252 -1.21176419 2.484202941
## Arizona 0.07163341 1.47880321 0.99898006 1.042878388
## Arkansas 0.23234938 0.23086801 -1.07359268 -0.184916602
## California 0.27826823 1.26281442 1.75892340 2.067820292
## Colorado 0.02571456 0.39885929 0.86080854 1.864967207
## Connecticut -1.03041900 -0.72908214 0.79172279 -1.081740768
## Delaware -0.43347395 0.80683810 0.44629400 -0.579946294
## Florida 1.74767144 1.97077766 0.99898006 1.138966691
## Georgia 2.20685994 0.48285493 -0.38273510 0.487701523
## Hawaii -0.57123050 -1.49704226 1.20623733 -0.110181255
## Idaho -1.19113497 -0.60908837 -0.79724965 -0.750769945
## Illinois 0.59970018 0.93883125 1.20623733 0.295524916
## Indiana -0.13500142 -0.69308401 -0.03730631 -0.024769429
## Iowa -1.28297267 -1.37704849 -0.58999237 -1.060387812
## Kansas -0.41051452 -0.66908525 0.03177945 -0.345063775
## Kentucky 0.43898421 -0.74108152 -0.93542116 -0.526563903
## Louisiana 1.74767144 0.93883125 0.03177945 0.103348309
## Maine -1.30593210 -1.05306531 -1.00450692 -1.434064548
## Maryland 0.80633501 1.55079947 0.10086521 0.701231086
## Massachusetts -0.77786532 -0.26110644 1.34440885 -0.526563903
## Michigan 0.99001041 1.01082751 0.58446551 1.480613993
## Minnesota -1.16817555 -1.18505846 0.03177945 -0.676034598
## Mississippi 1.90838741 1.05882502 -1.48810723 -0.441152078
## Missouri 0.27826823 0.08687549 0.30812248 0.743936999
## Montana -0.41051452 -0.74108152 -0.86633540 -0.515887425
## Nebraska -0.80082475 -0.82507715 -0.24456358 -0.505210947
## Nevada 1.01296983 0.97482938 1.06806582 2.644350114
## New Hampshire -1.30593210 -1.36504911 -0.65907813 -1.252564419
## New Jersey -0.08908257 -0.14111267 1.62075188 -0.259651949
## New Mexico 0.82929443 1.37080881 0.30812248 1.160319648
## New York 0.76041616 0.99882813 1.41349461 0.519730957
## North Carolina 1.19664523 1.99477641 -1.41902147 -0.547916860
## North Dakota -1.60440462 -1.50904164 -1.48810723 -1.487446939
## Ohio -0.11204199 -0.60908837 0.65355127 0.017936483
## Oklahoma -0.27275797 -0.23710769 0.16995096 -0.131534211
## Oregon -0.66306820 -0.14111267 0.10086521 0.861378259
## Pennsylvania -0.34163624 -0.77707965 0.44629400 -0.676034598
## Rhode Island -1.00745957 0.03887798 1.48258036 -1.380682157
## South Carolina 1.51807718 1.29881255 -1.21176419 0.135377743
## South Dakota -0.91562187 -1.01706718 -1.41902147 -0.900240639
## Tennessee 1.24256408 0.20686926 -0.45182086 0.605142783
## Texas 1.12776696 0.36286116 0.99898006 0.455672088
## Utah -1.05337842 -0.60908837 0.99898006 0.178083656
## Vermont -1.28297267 -1.47304350 -2.31713632 -1.071064290
## Virginia 0.16347111 -0.17711080 -0.17547783 -0.056798864
## Washington -0.86970302 -0.30910395 0.51537975 0.530407436
## West Virginia -0.47939280 -1.07706407 -1.83353601 -1.273917376
## Wisconsin -1.19113497 -1.41304662 0.03177945 -1.113770203
## Wyoming -0.22683912 -0.11711392 -0.38273510 -0.601299251
There are 2 packages that support PAM. * cluster * fpc
install.packages("cluster","fpc")
library(cluster)
library(factoextra)
fviz_nbclust(df_scaled, pam, method = "silhouette") +
theme_classic()
pam_res <- pam(df_scaled, 2)
pam_res
## Medoids:
## ID Murder Assault UrbanPop Rape
## New Mexico 31 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska 27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Objective function:
## build swap
## 1.441358 1.368969
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
the_us_arrests <- USArrests
the_us_arrests$state <- rownames(USArrests)
the_clusters <- data.frame(state = names(pam_res$cluster),cluster=pam_res$cluster)
the_us_arrests <- left_join(the_us_arrests,the_clusters)
the_us_arrests
## Murder Assault UrbanPop Rape state cluster
## 1 13.2 236 58 21.2 Alabama 1
## 2 10.0 263 48 44.5 Alaska 1
## 3 8.1 294 80 31.0 Arizona 1
## 4 8.8 190 50 19.5 Arkansas 2
## 5 9.0 276 91 40.6 California 1
## 6 7.9 204 78 38.7 Colorado 1
## 7 3.3 110 77 11.1 Connecticut 2
## 8 5.9 238 72 15.8 Delaware 2
## 9 15.4 335 80 31.9 Florida 1
## 10 17.4 211 60 25.8 Georgia 1
## 11 5.3 46 83 20.2 Hawaii 2
## 12 2.6 120 54 14.2 Idaho 2
## 13 10.4 249 83 24.0 Illinois 1
## 14 7.2 113 65 21.0 Indiana 2
## 15 2.2 56 57 11.3 Iowa 2
## 16 6.0 115 66 18.0 Kansas 2
## 17 9.7 109 52 16.3 Kentucky 2
## 18 15.4 249 66 22.2 Louisiana 1
## 19 2.1 83 51 7.8 Maine 2
## 20 11.3 300 67 27.8 Maryland 1
## 21 4.4 149 85 16.3 Massachusetts 2
## 22 12.1 255 74 35.1 Michigan 1
## 23 2.7 72 66 14.9 Minnesota 2
## 24 16.1 259 44 17.1 Mississippi 1
## 25 9.0 178 70 28.2 Missouri 1
## 26 6.0 109 53 16.4 Montana 2
## 27 4.3 102 62 16.5 Nebraska 2
## 28 12.2 252 81 46.0 Nevada 1
## 29 2.1 57 56 9.5 New Hampshire 2
## 30 7.4 159 89 18.8 New Jersey 2
## 31 11.4 285 70 32.1 New Mexico 1
## 32 11.1 254 86 26.1 New York 1
## 33 13.0 337 45 16.1 North Carolina 1
## 34 0.8 45 44 7.3 North Dakota 2
## 35 7.3 120 75 21.4 Ohio 2
## 36 6.6 151 68 20.0 Oklahoma 2
## 37 4.9 159 67 29.3 Oregon 2
## 38 6.3 106 72 14.9 Pennsylvania 2
## 39 3.4 174 87 8.3 Rhode Island 2
## 40 14.4 279 48 22.5 South Carolina 1
## 41 3.8 86 45 12.8 South Dakota 2
## 42 13.2 188 59 26.9 Tennessee 1
## 43 12.7 201 80 25.5 Texas 1
## 44 3.2 120 80 22.9 Utah 2
## 45 2.2 48 32 11.2 Vermont 2
## 46 8.5 156 63 20.7 Virginia 2
## 47 4.0 145 73 26.2 Washington 2
## 48 5.7 81 39 9.3 West Virginia 2
## 49 2.6 53 66 10.8 Wisconsin 2
## 50 6.8 161 60 15.6 Wyoming 2
pam_res$medoids
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska -0.8008247 -0.8250772 -0.2445636 -0.5052109
pam_res$clustering
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
fviz_cluster(pam_res,palete = c("#00CDCD", "#FFA07A"),ellipse.type = "t",
repel = TRUE, ggtheme = theme_classic())
library(fpc)
pamk_res <- fpc::pamk(data = df_scaled)
pamk_res
## $pamobject
## Medoids:
## ID Murder Assault UrbanPop Rape
## New Mexico 31 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska 27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Objective function:
## build swap
## 1.441358 1.368969
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $nc
## [1] 2
##
## $crit
## [1] 0.0000000 0.4084890 0.3143656 0.3389904 0.3105170 0.2629987 0.2243815
## [8] 0.2386072 0.2466113 0.2447023
the_us_arrests <- USArrests
the_us_arrests$state <- rownames(USArrests)
the_clusters <- data.frame(state = names(pamk_res$pamobject$clustering),cluster=pamk_res$pamobject$clustering)
the_us_arrests <- left_join(the_us_arrests,the_clusters)
the_us_arrests
## Murder Assault UrbanPop Rape state cluster
## 1 13.2 236 58 21.2 Alabama 1
## 2 10.0 263 48 44.5 Alaska 1
## 3 8.1 294 80 31.0 Arizona 1
## 4 8.8 190 50 19.5 Arkansas 2
## 5 9.0 276 91 40.6 California 1
## 6 7.9 204 78 38.7 Colorado 1
## 7 3.3 110 77 11.1 Connecticut 2
## 8 5.9 238 72 15.8 Delaware 2
## 9 15.4 335 80 31.9 Florida 1
## 10 17.4 211 60 25.8 Georgia 1
## 11 5.3 46 83 20.2 Hawaii 2
## 12 2.6 120 54 14.2 Idaho 2
## 13 10.4 249 83 24.0 Illinois 1
## 14 7.2 113 65 21.0 Indiana 2
## 15 2.2 56 57 11.3 Iowa 2
## 16 6.0 115 66 18.0 Kansas 2
## 17 9.7 109 52 16.3 Kentucky 2
## 18 15.4 249 66 22.2 Louisiana 1
## 19 2.1 83 51 7.8 Maine 2
## 20 11.3 300 67 27.8 Maryland 1
## 21 4.4 149 85 16.3 Massachusetts 2
## 22 12.1 255 74 35.1 Michigan 1
## 23 2.7 72 66 14.9 Minnesota 2
## 24 16.1 259 44 17.1 Mississippi 1
## 25 9.0 178 70 28.2 Missouri 1
## 26 6.0 109 53 16.4 Montana 2
## 27 4.3 102 62 16.5 Nebraska 2
## 28 12.2 252 81 46.0 Nevada 1
## 29 2.1 57 56 9.5 New Hampshire 2
## 30 7.4 159 89 18.8 New Jersey 2
## 31 11.4 285 70 32.1 New Mexico 1
## 32 11.1 254 86 26.1 New York 1
## 33 13.0 337 45 16.1 North Carolina 1
## 34 0.8 45 44 7.3 North Dakota 2
## 35 7.3 120 75 21.4 Ohio 2
## 36 6.6 151 68 20.0 Oklahoma 2
## 37 4.9 159 67 29.3 Oregon 2
## 38 6.3 106 72 14.9 Pennsylvania 2
## 39 3.4 174 87 8.3 Rhode Island 2
## 40 14.4 279 48 22.5 South Carolina 1
## 41 3.8 86 45 12.8 South Dakota 2
## 42 13.2 188 59 26.9 Tennessee 1
## 43 12.7 201 80 25.5 Texas 1
## 44 3.2 120 80 22.9 Utah 2
## 45 2.2 48 32 11.2 Vermont 2
## 46 8.5 156 63 20.7 Virginia 2
## 47 4.0 145 73 26.2 Washington 2
## 48 5.7 81 39 9.3 West Virginia 2
## 49 2.6 53 66 10.8 Wisconsin 2
## 50 6.8 161 60 15.6 Wyoming 2
plot(pamk_res$pamobject)
CLARA extends the k-medoids algorithm to handle big data. It utilizes sampling to handle the large data sets.
Minimize Sampling Bias by running multiple samples and comparing the results. Each sample medoid is measured by the avergage dissimilarity of each object.
set.seed(1234)
df_clust_1 <- data.frame(x=rnorm(n = 200,mean = 0,sd = 8),y=rnorm(n = 200,mean = 0,sd = 8))
df_clust_2 <- data.frame(x=rnorm(n = 300,mean = 50,sd = 8),y=rnorm(n = 300,mean = 50,sd = 8))
df <- rbind(df_clust_1,df_clust_2)
rownames(df) <- paste0("S",1:nrow(df))
head(df)
## x y
## S1 -9.656526 3.881815
## S2 2.219434 5.574150
## S3 8.675529 1.484111
## S4 -18.765582 5.605868
## S5 3.432998 2.493448
## S6 4.048447 6.083699
Packages: cluster, factoextra functions: clara()
Utilize factoextra::fviz_nbclust() function to determine the correct number of clusters.
library(cluster)
library(factoextra)
fviz_nbclust(df,clara,method = "silhouette") +
theme_classic()
The number of clusters created in the simulated data frame is 2. The results from the optimal cluster determination is 2.
clara_res <- clara(df, 2, samples = 50, pamLike = TRUE)
knitr::knit_print(clara_res)
## Call: clara(x = df, k = 2, samples = 50, pamLike = TRUE)
## Medoids:
## x y
## S121 -1.531137 1.145057
## S455 48.357304 50.233499
## Objective function: 9.87862
## Clustering vector: Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes: 200 300
## Best sample:
## [1] S37 S49 S54 S63 S68 S71 S76 S80 S82 S101 S103 S108 S109 S118 S121
## [16] S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261 S286 S299
## [31] S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465 S488 S497
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
df$id <- rownames(df)
df <- df%>%select(id,x,y)
clustering <- data.frame(id= names(clara_res$clustering), cluster = clara_res$clustering)
df <- left_join(df,clustering)
head(df)
## id x y cluster
## 1 S1 -9.656526 3.881815 1
## 2 S2 2.219434 5.574150 1
## 3 S3 8.675529 1.484111 1
## 4 S4 -18.765582 5.605868 1
## 5 S5 3.432998 2.493448 1
## 6 S6 4.048447 6.083699 1
clara_res$medoids
## x y
## S121 -1.531137 1.145057
## S455 48.357304 50.233499
fviz_cluster(clara_res,palette=c("#008B8B", "#EE3B3B"),ellipse.type = "t",geom = "point",ggtheme = theme_classic()
)
The CLARA algorithm is an extension to the PAM clustering method for large data sets. You must specify the number of clusters.
Also known as Hierarchial cluster analysis (HCA)
There are two types of Hierachial Clusting
Agglomearative is from the bottom up and Divisive is from the top down.
The term dendogram is used to describe the hierarchial structure of clustering.
This clustering method initially assigns each observation as its own cluster (leaf), and then iterates to find common leafs that will match together to create the next level of cluster.
set.seed(123)
data("USArrests")
df <- USArrests
df_scaled <- df%>%mutate_all(.funs =funs(scale(.) %>% as.vector) )
row.names(df_scaled) <- row.names(df)
head(df_scaled)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
res_dist <- df_scaled%>%dist(method = "euclidian")
head(res_dist)
## [1] 2.703754 2.293520 1.289810 3.263110 2.651067 3.215297
res_hc <- hclust(d = res_dist,method = "ward.D2")
library(factoextra)
fviz_dend(res_hc, cex = 0.8)
res_coph <- cophenetic(res_hc)
cor(res_dist,res_coph)
## [1] 0.6975266
res_hc2 <- hclust(res_dist, method="average")
cor(res_dist,cophenetic(res_hc2))
## [1] 0.7180382
grp <- cutree(res_hc,k = 4)
grp
## Alabama Alaska Arizona Arkansas California
## 1 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
table(grp)
## grp
## 1 2 3 4
## 7 12 19 12
str(grp)
## Named int [1:50] 1 2 2 3 2 2 3 3 2 1 ...
## - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
fviz_dend(res_hc,k = 4, cex = 0.5, k_colors = c("cornflowerblue", "aquamarine", "darkorange2", "darkkhaki"),
color_labels_by_k = TRUE,
rect = TRUE)
fviz_cluster(list(data=df,cluster = grp),
palette = c("cornflowerblue", "aquamarine", "darkorange2", "darkkhaki"),
ellipse.type = "convex",
repel = TRUE,
show.clust.cent = FALSE,
ggtheme = theme_minimal())
Agglomerative Nesting (Hierarchial Clustering)
library(cluster)
res_agnes <- agnes(x = USArrests,
stand = TRUE,
metric = "euclidean",
method = "ward")
res_agnes
## Call: agnes(x = USArrests, metric = "euclidean", stand = TRUE, method = "ward")
## Agglomerative coefficient: 0.934098
## Order of objects:
## [1] Alabama Louisiana Georgia Tennessee Mississippi
## [6] South Carolina North Carolina Alaska California Nevada
## [11] Colorado Arizona Maryland New Mexico Michigan
## [16] Florida Illinois New York Texas Arkansas
## [21] Kentucky Virginia Wyoming Indiana Kansas
## [26] Oklahoma Ohio Pennsylvania Missouri Oregon
## [31] Washington Connecticut Rhode Island Delaware Hawaii
## [36] Utah Massachusetts New Jersey Idaho Montana
## [41] Nebraska Iowa New Hampshire Maine Minnesota
## [46] Wisconsin North Dakota Vermont South Dakota West Virginia
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2620 0.9627 1.3332 2.2033 2.2249 16.3214
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call"
## [7] "method" "order.lab" "data"
DIvisive ANAlysis Clustering
res_diana <- diana(x = USArrests,
stand = TRUE,
metric = "euclidean")
res_diana
## Merge:
## [,1] [,2]
## [1,] -15 -29
## [2,] -13 -32
## [3,] -23 -49
## [4,] -14 -36
## [5,] -20 -31
## [6,] -16 -38
## [7,] -37 -47
## [8,] 1 -19
## [9,] 4 -35
## [10,] -41 -48
## [11,] -46 -50
## [12,] -12 -27
## [13,] -21 -30
## [14,] -24 -40
## [15,] 2 -43
## [16,] -17 -26
## [17,] -1 -42
## [18,] -10 -18
## [19,] 9 6
## [20,] -4 11
## [21,] -11 -44
## [22,] -7 -39
## [23,] 8 -34
## [24,] 10 -45
## [25,] 5 -22
## [26,] 17 18
## [27,] 14 -33
## [28,] 12 3
## [29,] -5 -28
## [30,] -3 25
## [31,] 29 -6
## [32,] 15 -25
## [33,] 30 32
## [34,] 22 13
## [35,] 23 24
## [36,] 19 7
## [37,] 20 -8
## [38,] 34 21
## [39,] 28 16
## [40,] 37 36
## [41,] 26 27
## [42,] 39 35
## [43,] 33 -9
## [44,] 43 31
## [45,] 40 38
## [46,] -2 44
## [47,] 45 42
## [48,] 41 46
## [49,] 48 47
## Order of objects:
## [1] Alabama Tennessee Georgia Louisiana Mississippi
## [6] South Carolina North Carolina Alaska Arizona Maryland
## [11] New Mexico Michigan Illinois New York Texas
## [16] Missouri Florida California Nevada Colorado
## [21] Arkansas Virginia Wyoming Delaware Indiana
## [26] Oklahoma Ohio Kansas Pennsylvania Oregon
## [31] Washington Connecticut Rhode Island Massachusetts New Jersey
## [36] Hawaii Utah Idaho Nebraska Minnesota
## [41] Wisconsin Kentucky Montana Iowa New Hampshire
## [46] Maine North Dakota South Dakota West Virginia Vermont
## Height:
## [1] 1.0340801 1.3687929 1.0408349 2.8205587 0.9771986 1.3960934 5.4623592
## [8] 4.0837483 1.4720236 0.6743186 1.3385319 1.9655812 0.4336501 1.0044328
## [15] 1.8355779 2.9848751 3.0534495 1.4621581 1.7044355 7.4792599 1.2084105
## [22] 0.8788445 2.1852059 2.5927716 0.6281577 0.8500187 1.0553617 0.6765523
## [29] 2.1808837 0.7311305 3.7810250 1.2903008 1.9665793 0.9626545 2.5032675
## [36] 1.2698293 5.3463933 0.9132211 1.4571402 0.6215418 2.5227753 1.0181020
## [43] 2.9549994 0.2619577 0.7930730 1.2918266 2.1082699 0.8721679 1.3129132
## Divisive coefficient:
## [1] 0.8530481
##
## Available components:
## [1] "order" "height" "dc" "merge" "diss" "call"
## [7] "order.lab" "data"
Visualize
library(factoextra)
fviz_dend(res_agnes, cex = 0.6, k = 4)
data("USArrests")
df <- USArrests #%>%sample_n(15)
df_scaled <- df%>%mutate_all(.funs =funs(scale(.) %>% as.vector) )
row.names(df_scaled) <- row.names(df)
df_scaled
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.78283935 -0.52090661 -0.003416473
## Alaska 0.50786248 1.10682252 -1.21176419 2.484202941
## Arizona 0.07163341 1.47880321 0.99898006 1.042878388
## Arkansas 0.23234938 0.23086801 -1.07359268 -0.184916602
## California 0.27826823 1.26281442 1.75892340 2.067820292
## Colorado 0.02571456 0.39885929 0.86080854 1.864967207
## Connecticut -1.03041900 -0.72908214 0.79172279 -1.081740768
## Delaware -0.43347395 0.80683810 0.44629400 -0.579946294
## Florida 1.74767144 1.97077766 0.99898006 1.138966691
## Georgia 2.20685994 0.48285493 -0.38273510 0.487701523
## Hawaii -0.57123050 -1.49704226 1.20623733 -0.110181255
## Idaho -1.19113497 -0.60908837 -0.79724965 -0.750769945
## Illinois 0.59970018 0.93883125 1.20623733 0.295524916
## Indiana -0.13500142 -0.69308401 -0.03730631 -0.024769429
## Iowa -1.28297267 -1.37704849 -0.58999237 -1.060387812
## Kansas -0.41051452 -0.66908525 0.03177945 -0.345063775
## Kentucky 0.43898421 -0.74108152 -0.93542116 -0.526563903
## Louisiana 1.74767144 0.93883125 0.03177945 0.103348309
## Maine -1.30593210 -1.05306531 -1.00450692 -1.434064548
## Maryland 0.80633501 1.55079947 0.10086521 0.701231086
## Massachusetts -0.77786532 -0.26110644 1.34440885 -0.526563903
## Michigan 0.99001041 1.01082751 0.58446551 1.480613993
## Minnesota -1.16817555 -1.18505846 0.03177945 -0.676034598
## Mississippi 1.90838741 1.05882502 -1.48810723 -0.441152078
## Missouri 0.27826823 0.08687549 0.30812248 0.743936999
## Montana -0.41051452 -0.74108152 -0.86633540 -0.515887425
## Nebraska -0.80082475 -0.82507715 -0.24456358 -0.505210947
## Nevada 1.01296983 0.97482938 1.06806582 2.644350114
## New Hampshire -1.30593210 -1.36504911 -0.65907813 -1.252564419
## New Jersey -0.08908257 -0.14111267 1.62075188 -0.259651949
## New Mexico 0.82929443 1.37080881 0.30812248 1.160319648
## New York 0.76041616 0.99882813 1.41349461 0.519730957
## North Carolina 1.19664523 1.99477641 -1.41902147 -0.547916860
## North Dakota -1.60440462 -1.50904164 -1.48810723 -1.487446939
## Ohio -0.11204199 -0.60908837 0.65355127 0.017936483
## Oklahoma -0.27275797 -0.23710769 0.16995096 -0.131534211
## Oregon -0.66306820 -0.14111267 0.10086521 0.861378259
## Pennsylvania -0.34163624 -0.77707965 0.44629400 -0.676034598
## Rhode Island -1.00745957 0.03887798 1.48258036 -1.380682157
## South Carolina 1.51807718 1.29881255 -1.21176419 0.135377743
## South Dakota -0.91562187 -1.01706718 -1.41902147 -0.900240639
## Tennessee 1.24256408 0.20686926 -0.45182086 0.605142783
## Texas 1.12776696 0.36286116 0.99898006 0.455672088
## Utah -1.05337842 -0.60908837 0.99898006 0.178083656
## Vermont -1.28297267 -1.47304350 -2.31713632 -1.071064290
## Virginia 0.16347111 -0.17711080 -0.17547783 -0.056798864
## Washington -0.86970302 -0.30910395 0.51537975 0.530407436
## West Virginia -0.47939280 -1.07706407 -1.83353601 -1.273917376
## Wisconsin -1.19113497 -1.41304662 0.03177945 -1.113770203
## Wyoming -0.22683912 -0.11711392 -0.38273510 -0.601299251
library(dplyr)
set.seed(123)
df_sample <- df_scaled%>%sample_n(10)
head(df_sample)
## Murder Assault UrbanPop Rape
## New Mexico 0.82929443 1.3708088 0.30812248 1.16031965
## Iowa -1.28297267 -1.3770485 -0.58999237 -1.06038781
## Indiana -0.13500142 -0.6930840 -0.03730631 -0.02476943
## Arizona 0.07163341 1.4788032 0.99898006 1.04287839
## Tennessee 1.24256408 0.2068693 -0.45182086 0.60514278
## Texas 1.12776696 0.3628612 0.99898006 0.45567209
library(dendextend)
res_dist <- dist(df_sample, method = "euclidean")
hc1 <- hclust(res_dist, method = "average")
hc2 <- hclust(res_dist, method = "ward.D2")
dend1 <- as.dendrogram(hc1)
dend2 <- as.dendrogram(hc2)
dend_list <- dendlist(dend1,dend2)
tanglegram(dend1,dend2)
tanglegram(dend1,dend2,
highlight_distinct_edges = FALSE,
common_subtrees_color_lines = FALSE,
common_subtrees_color_branches = TRUE,
main = paste("entanglement =",round(entanglement(dend_list),2)))
library(cluster)
cor.dendlist(dend_list,method = "cophenetic")
## [,1] [,2]
## [1,] 1.0000000 0.9925544
## [2,] 0.9925544 1.0000000
cor.dendlist(dend_list,method = "baker")
## [,1] [,2]
## [1,] 1.0000000 0.9895528
## [2,] 0.9895528 1.0000000
cor_cophenetic(dend1,dend2)
## [1] 0.9925544
cor_bakers_gamma(dend1,dend2)
## [1] 0.9895528
dend1 <- df_sample%>%dist%>% hclust("complete" ) %>% as.dendrogram
dend2 <- df_sample%>%dist%>% hclust("single" ) %>% as.dendrogram
dend3 <- df_sample%>%dist%>% hclust("average" ) %>% as.dendrogram
dend4 <- df_sample%>%dist%>% hclust("centroid" ) %>% as.dendrogram
dend_list <- dendlist("Complete"= dend1, "Single" = dend2,
"Average"=dend3, "Centroid"= dend4)
cors <-- cor.dendlist(dend_list)
round(cors, 2)
## Complete Single Average Centroid
## Complete -1.00 -0.46 -0.45 -0.30
## Single -0.46 -1.00 -0.23 -0.17
## Average -0.45 -0.23 -1.00 -0.31
## Centroid -0.30 -0.17 -0.31 -1.00
library(corrplot)
corrplot(cors, "pie","lower")
data("USArrests")
df_scale <- USArrests%>%mutate_all(.funs = funs(scale(.)))
rownames(df_scale) <- rownames(USArrests)
dd <- dist(df_scale, method = "euclidean")
hc <- dd%>%hclust(method = "ward.D2")
hc
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 50
library(factoextra)
fviz_dend(hc, cex = 0.5)
fviz_dend(hc, cex = 0.5,
main = "Dendogram - ward.D2",
xlab = "Objects",
ylab = "Distance",
sub = "")
fviz_dend(hc, k=4,
cex = 0.5,
k_colors = c("#EE3B3B", "#8470FF", "#76EEC6", "#EEC900"),
main = "Dendogram - ward.D2",
xlab = "Objects",
ylab = "Distance",
sub = "",
color_labels_by_k = TRUE,
rect = TRUE,
rect_border = c("#EE3B3B", "#8470FF", "#76EEC6", "#EEC900"),
rect_fill = TRUE)
fviz_dend(hc, k=4,
cex = 0.5,
k_colors = c("#EE3B3B", "#8470FF", "#76EEC6", "#EEC900"),
main = "Dendogram - ward.D2",
xlab = "Objects",
ylab = "Distance",type = "circular",
sub = "",
color_labels_by_k = TRUE,
rect = TRUE,
rect_border = c("#EE3B3B", "#8470FF", "#76EEC6", "#EEC900"),
rect_fill = TRUE)
require(igraph)
fviz_dend(hc, k=4,
k_colors = "jco",
main = "Dendogram - ward.D2",
xlab = "Objects",
ylab = "Distance",
sub = "",
type = "phylogenic",
phylo_layout = "layout.gem",
repel = TRUE)
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
df <- iris%>%select(-Species)
random_df <- apply(df,2, function(x){
runif(length(x),min(x),max(x))
})
random_df <- as.data.frame(random_df)
df <- iris_scaled <- df%>%mutate_all(.funs = funs(scale(.)))%>%as.vector
random_df <- random_df %>% mutate_all(.funs = funs(scale(.))) %>% as.vector
library(factoextra)
fviz_pca_ind(prcomp(df), title= "PCA - Iris data",
habillage = iris$Species, palette = "jco",
geom = "point",
ggtheme = theme_classic(),
legend = "bottom")
fviz_pca_ind(prcomp(random_df), title= "PCA - Random data",
habillage = iris$Species, palette = "jco",
geom = "point",
ggtheme = theme_classic(),
legend = "bottom")
library(factoextra)
set.seed(123)
km_res1 <- kmeans(df, 3)
fviz_cluster(list(data = df, cluster = km_res1$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
km_res2 <- kmeans(random_df, 3)
fviz_cluster(list(data = df, cluster = km_res2$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
fviz_dend(hclust(dist(random_df)), k = 3, k_colors = "jco", as.ggplot = TRUE, show_labels = FALSE)
There are two methods recommended for evaluating the clustering tendency:
Hopkins statistical method.
Rejecting the Null hypothesis occurs if the Hopkins statistic is close to zero.
Non-Random Data Set
library(clustertend)
set.seed(123)
hopkins(df,n = nrow(df)-1)
## $H
## [1] 0.1815219
Null Hypothesis is rejected. Clustering is possible.
Random Data Set
set.seed(123)
hopkins(random_df,n = nrow(random_df) - 1)
## $H
## [1] 0.5145653
Null Hypothesis is TRUE. Clustering is not possible
fviz_dist(dist(df), show_labels = FALSE) +
labs(title = "Iris data")
fviz_dist(dist(random_df), show_labels = FALSE) +
labs(title = "Random data")
There is no specific method for determing the optimal number of clusters. There are about 30 algorithms that can be used to project the best number of clusters. The following 3 methods are fequently consumed:
Two functions to use:
1) factoextra::fviz_nbclust() 2) NbClust::NbClust()
library(factoextra)
library(NbClust)
df <- USArrests %>% mutate_all(.funs = funs(scale(.))) %>% as.vector
head(df)
## Murder Assault UrbanPop Rape
## 1 1.24256408 0.7828393 -0.5209066 -0.003416473
## 2 0.50786248 1.1068225 -1.2117642 2.484202941
## 3 0.07163341 1.4788032 0.9989801 1.042878388
## 4 0.23234938 0.2308680 -1.0735927 -0.184916602
## 5 0.27826823 1.2628144 1.7589234 2.067820292
## 6 0.02571456 0.3988593 0.8608085 1.864967207
Elbow Method (WSS)
fviz_nbclust(df,kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)
Silhouette Method
fviz_nbclust(df, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")
Gap Statitic Method
set.seed(123)
fviz_nbclust(df,kmeans, nstart = 25, method = "gap_stat", nboot =50) +
labs(subtitle = "Gap statitic method")